Cost Effective Dependency Parsing for Indian Languages
نویسندگان
چکیده
Indian languages are MoR-FWO1 and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accordingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. In the conventional Dependency Parsing methods, the focus has been on developing robust data driven dependency parsing techniques. This initiated efforts in creating hand annotated large treebanks, consisting of hand annotated features. These treebanks serve as input for the training of data-driven parsers. The annotations in Indian Languages’ treebanks are generally multi-layered and furnish information on part of speech category of word forms, their morphological features, related word groups and the syntactic relations. For improvements, richer and richer features are being added. This process of manual annotation is expensive, as it requires a lot of human efforts. It is a tedious task to create treebanks for all the languages. Even if we make the treebanks available, in the real time scenario we require many tools to extract features automatically. Building such tools is also a complex task. We are in an era with almost unlimited access to raw data. Nevertheless, we often struggle to make sense of most of it. Much of this data is unlabeled and thus useless in many of the traditional supervised machine learning scenarios, that require explicit labeled/hand-annotated examples. In this work, we present our efforts towards exploring cost effective approaches for building and improving parsers for resource-poor languages. For this purpose we try to use unsupervised techniques to extract features from the largely available mono-lingual raw corpus. Using cross-lingual treebank transfer, we exploit the available treebanks for other languages and using some techniques like MT2 and try to generate a treebank for the target language. We can use this treebank for training of parser. We first try this approach for Hindi. An important constraint for using this approach is that the annotation of treebank needs to be similar crosslinguistically. For this, we use UD3 framework. Universal Dependencies is an initiative to create cross-linguistically consistent treebank annotation for many languages, with the goal of 1morphologically rich and free word order 2Machine Translation 3Universal Dependencies
منابع مشابه
تأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملSimple Parser for Indian Languages in a Dependency Framework
This paper is an attempt to show that an intermediary level of analysis is an effective way for carrying out various NLP tasks for linguistically similar languages. We describe a process for developing a simple parser for doing such tasks. This parser uses a grammar driven approach to annotate dependency relations (both inter and intra chunk) at an intermediary level. Ease in identifying a part...
متن کاملBidirectional Dependency Parser for Indian Languages
In this paper, we apply bidirectional dependency parsing algorithm for parsing Indian languages such as Hindi, Bangla and Telugu as part of NLP Tools Contest, ICON 2010. The parser builds the dependency tree incrementally with the two operations namely proj and non-proj. The complete dependency tree given by the unlabeled parser is used by SVM (Support Vector Machines) classifier for labeling. ...
متن کاملDependency Parsing of Indian Languages with DeSR
DeSR is a statistical transition-based dependency parser which learns from annotated corpora which actions to perform for building parse trees while scanning a sentence. We describe the experiments performed for the ICON 2010 Tools Contest on Indian Dependency Parsing. DesR was configured to exploit specific features from the Indian treebanks. The submitted run used a stacked combination of fou...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کامل